Development and validation of deep learning models for identifying the brand of pedicle screws on plain spine radiographs

Abstract Background In spinal revision surgery, previous pedicle screws (PS) may need to be replaced with new implants. Failure to accurately identify the brand of PS‐based instrumentation preoperatively may increase the risk of perioperative complications. This study aimed to develop and validate an optimal deep learning (DL) model to identify the brand of PS‐based instrumentation on plain radiographs of spine (PRS) using anteroposterior (AP) and lateral images. Methods A total of 529 patients who received PS‐based instrumentation from seven manufacturers were enrolled in this retrospective study. The postoperative PRS were gathered as ground truths. The training, validation, and testing datasets contained 338, 85, and 106 patients, respectively. YOLOv5 was used to crop out the screws' trajectory, and the EfficientNet‐b0 model was used to develop single models (AP, Lateral, Merge, and Concatenated) based on the different PRS images. The ensemble models were different combinations of the single models. Primary outcomes were the models' performance in accuracy, sensitivity, precision, F1‐score, kappa value, and area under the curve (AUC). Secondary outcomes were the relative performance of models versus human readers and external validation of the DL models. Results The Lateral model had the most stable performance among single models. The discriminative performance was improved by the ensemble method. The AP + Lateral ensemble model had the most stable performance, with an accuracy of 0.9434, F1 score of 0.9388, and AUC of 0.9834. The performance of the ensemble models was comparable to that of experienced orthopedic surgeons and superior to that of inexperienced orthopedic surgeons. External validation revealed that the Lat + Concat ensemble model had the best accuracy (0.9412). Conclusion The DL models demonstrated stable performance in identifying the brand of PS‐based instrumentation based on AP and/or lateral images of PRS, which may assist orthopedic spine surgeons in preoperative revision planning in clinical practice.


| INTRODUCTION
Pedicle screw (PS)-based instrumentation is the commonly used internal fixation device for the treatment of spinal degenerative disease, deformities, tumors, and fractures. 1However, symptomatic adjacent segment degeneration 2 and failed back surgery syndrome 3 are common reasons for revision surgery.In spinal revision surgery, previous implants may need to be removed and replaced with new implants.
Hence, orthopedic surgeons must accurately identify the brand of the existing implants and gather the appropriate surgical equipment for implant removal, since the universal removal set is expensive and may not be available in all hospitals.Failure to accurately identify PS-based instrumentations preoperatively may increase the surgical time and the risk of perioperative complications.
In clinical practice, implants are typically identified using plain radiographs of the spine (PRS).5][6][7] Additionally, numerous studies suggested the potential of DL models to recognize knee and hip arthroplasties, [8][9][10] and cervical plating systems. 11,12Yang et al. 13 reported that a variety of DL models are effective for one-segment spinal implant identification, yielding 76.0%-98.7%precision and 72.0%-98.4% recall; however, the performance of DL models in identifying spinal implants in multisegment fixation has not been investigated yet.While DL models have been used to identify the shaft of PS in the PRS 14 and the surrounding pedicle anatomy in CT scans, 15 these studies did not address the ability of DL models to identify the device manufacturer.Moreover, the generalizability of the ground truth plays an important role in the performance of DL models. 7 hypothesized that the DL model may have stable performance in identifying PS-based instrumentation in the PRS and that the ground truth of different images on the PRS may affect the DL model performance.The objectives of this study were as follows: (1) to develop various DL models based on the different ground truths of PRS on anteroposterior (AP) and lateral images and to evaluate their performance in identifying different brands of PS-based instrumentation; (2) to investigate the effect of PRS at AP or lateral images on the performance of the DL models; (3) to determine whether ensemble methods improve the model's performance and validate the optimal model; and (4) to compare the performance of our models with human readers and assess the performance of the DL models via external validation.Taipei City, Taiwan), (3) Gezen (BioLife Medical Device Inc, Hsinchu City, Taiwan), (4) CDH (CDM8; Medtronic, Minneapolis, MN, USA), ( 5) Expedium (DePuy Synthes Inc., West Chester, PA, USA), ( 6) NOVA (BAUI Biotech Co., Ltd., New Taipei City, Taiwan), and (7) Xia 3 (Stryker Spine, Allendale, NJ, USA) (Figure 1).

| Plain radiography technique
The radiography machine used a high-voltage generator (UD150B-40; Shimadzu Corp., Kyoto, Japan) with a voltage of 94 kVp and an average current of 56 mAs for 360 ms.Computer software was used to investigate instrumentation on PRS in the AP and lateral projections (Smart Viewer 3.2; Taiwan Electronic Data Processing Corp., Taipei City, Taiwan).

| Development of deep learning models
The pre-trained You Only Look Once version 5 (YOLOv5, arXiv) was used to identify and crop out the trajectory of PS in PRS to enhance the performance of models.Medical Artificial Intelligence Aggregator (MAIA) software (Muen Biomedical and Optoelectronic Technologist, Inc., Taipei City, Taiwan) was used for automated analysis of the medical images based on the structure of the built-in, pre-trained EfficientNet-b0 model on ImageNet (Figure 2). 16,17The graphic proces- the analysis type (i.e., classification, regression, or grading).The images were then resized to 256 Â 256 with 3 color channels, and Horizontal Flip and Rotate methods were used for data augmentation to prevent over-fitting. 18The batch size was decided according to the memory consumption.The loss function was calculated by cross-entropy loss or mean square error, depending on the type of analysis conducted.An Adam optimizer was used to minimize the loss. 19The learning rate was tuned using the one-cycle of cosine annealing strategy. 20,21der the framework of MAIA, the AP model, Lateral (Lat) model, Concatenated (Concat) model, and Merge model were developed based on different ground truths of AP and/or lateral images.In other words, all four models employed the same EfficientNet-B0 architecture (pre-trained on ImageNet) but were fine-tuned on different image datasets.The AP model was fine-tuned on the AP images of PRS, and the Lat model was fine-tuned on the lateral images of PRS.In addition, the AP and lateral images of the PRS were first combined to form a single concatenated image (shape: 256 Â 256 Â 3); the Concat model was fine-tuned on concatenated images of PRS.Finally, both AP and lateral images without concatenation were simultaneously used to fine-tune the Merge model.Therefore, the Merge model produced three predications based on AP images, lateral images, and dual images (both AP and lateral images).The ensemble models were constructed using logistic regression, assembling the predicted probabilities from different combinations of single models.

| Datasets for training, validating, and testing
The images of the 529 patients were divided into three groups: training dataset (n = 338), testing dataset (n = 106), and validation dataset (n = 85).The patient groups were stratified by brand and presence of crosslink, which ensured similarity in the ratios of different brands and in the presence of a crosslink in the training, validation, and testing datasets (Table 1).Only the training dataset was used to calculate the gradients and update the model parameters.
The validation dataset was used to evaluate the model during each phase of the training process, and the model with the lowest validation loss was selected.Finally, the selected model was evaluated using the testing dataset, which was kept completely independent from the training process.In a brand-based evaluation, all metrics except accuracy were calculated based on each device type, with one type considered positive and all the others considered negative.In an overall manner, the macro-average and micro-average were each calculated.aggregating the results of all brands to define true positive, false positive, true negative, and false negative, which were used to calculate metrics.

| DL model evaluation and statistical analysis
In addition to the numeric metrics mentioned above, MAIA also reported graphic illustration of a confusion matrix, receiver operating characteristic (ROC) curve, and a gradient-weighted class activation T A B L E 3 Brand-based evaluation of ensemble models, regardless of the presence of crosslinks.map (Grad-CAM). 22Grad-CAM was used to evaluate the heatmap for evidence that the model recognized the discriminative features of instrumentations, as indicated by a color transition from blue to red.
To evaluate the effect of crosslinks on model performance, numeric metrics based on PRS were calculated separately with or without crosslinks.

| Comparison of the performance between human readers and DL models
To compare the performance between our DL models and human readers, the AP and Lat images of PRS of 27 patients not included in our dataset were randomly selected from the included 529 patients using the randomization program. 23An accurate illustration of each implant was provided for readers beforehand (Figure 1).The six human readers included one medical student, one orthopedic resident, one spine fellow, one general orthopedic surgeon, and two orthopedic spine surgeons.Moreover, five additional orthopedic surgeons (Readers 7-11) from another medical center were invited to participate in the test using the same datasets.

| Evaluation of DL models by external validation
For external validation, we obtained a dataset from another medical institution that used a different plain radiographic technique for external validation; these images were from patients in a population bearing the same seven brands of screws (n = 31).

| RESULTS
Of the MAIA models, the Lat model had the most stable performance (Table 2).Of the ensemble models, the AP + Lat ensemble model exhibited the most stable performance (Table 3).The performance of the Ensemble models was superior to that of the MAIA models (Table 4).
To investigate whether the presence of a crosslink influenced the performance of the DL models, we analyzed the performance of the model based on the PRS, with or without the crosslink.Both MAIA (Table S1) and ensemble models (Table S2) performed better when a crosslink was included.
Results of the analysis of the confusion matrix and ROC curve in the MAIA models, regardless of crosslink, are shown in Figure 3 and The DL models focused on the discriminative regions of either screw pitch or crosslink to correctly classify PS-based instruments (Figure 5, Figure S5).
In the performance comparison between human readers and the DL models, the accuracy among human readers ranged from 0.37 to 0.89 (Table 5).The least accurate performance (0.37) was that of a medical student.In contrast, the average accuracy of four attending orthopedic spine surgeons was 0.823 ± 0.047.Test completion required an average of 752 ± 263 s (range: 587-1250 s) for human readers and 3 s for all models.The ensemble models achieved an accuracy of 0.89-1.00.The performance of these ensemble models was not inferior to those of experienced orthopedic spine surgeons.
T A B L E 4 Comparison between single and ensemble models, regardless of the presence of crosslinks.Regarding external validation (n = 31), the accuracy of the Lat model was 0.8824 (Table S3), and the accuracy of the Lat + Concat ensemble model was 0.9412 (Table S4).The testing model for automated identification of PSs is available at https://140.136.158.62/web_VF/x-ray-ps.html.

| DISCUSSION
In this study, we developed and validated DL models to identify PSbased instrumentation.Our results revealed that using the lateral image as the ground truth resulted in a more stable performance by our DL models; using the ensemble method also improved results.
The performance of the ensemble models was not inferior to that of experienced orthopedic spine surgeons.Taken together, these results suggest that these improved DL models can be an alternative means to identify PS-based instrumentation on PRS in clinical practice.
A DL model has been used to identify 15 types of cervical plating systems with 85.8% accuracy in the top-1 model based on 402 smartphone images. 11Another DL model is able to identify 9 types of cervical plating systems with an accuracy of 91.5% in the top-1 model based on 321 PRS. 12 The above-mentioned studies used the same three brands of cervical plating systems (Medtronic Atlantis Vision, Depuy Synthes CSLP, and Depuy Synthes Skyline). 11,12However, different ground truths were used; one was based on smartphone images, 11 and the other was based on PRS. 12 Consistently, the use of the top-1 statistical method achieved good discriminative performance in this study.We believe that our use of YOLOv5 to crop out the screw trajectory before brand identification and conducting ensemble analysis underlies these results.
AP images of PRS were often used as ground truths in DL models for identifying the implant design in different anatomic locations such as the cervical spine, knee, and hips. 8,11However, in the present study, we found that the ground truth of the lateral image provided a more stable result in single models.This phenomenon may in part result from the fact that different implants are designed for different types of anatomic fixation.For example, cervical plating is fixed at the anterior vertebrae, and the whole construct can be easily visualized on an AP image, as with knee and hip arthroplasties. 11,12In PS-based instrumentation, the trajectory is placed along the pedicle, and screw constructs at the neck and body may not be as clearly visible on AP images due to interference from the screw head and connecting rod.
Different PS systems may have distinct constructs (e.g., cylindrical vs. conical cores) 24 or differences in pitch, tip, and crosslink.Theoretically, the entire PS construct can be easily visual- Clinical investigations have reported excellent accuracy of DL models in discriminating hip arthroplasties using different models, achieving 99.6%-100% accuracy. 8,10,25One study 11 reported an accuracy of 94.4% in identifying 15 different cervical plating implants.
Studies using DL models to identify hip arthroplasties achieved a ROC of 0.98 9 to 0.99 8,25 discrimination, and accuracy reached 100%. 10e open-access website, Implant Identifier, 26 automatically identifies several arthroplasties of the hip, knee, elbow, shoulder, ankle, and wrist. 8However, this web application has not been used to identify spine implants such as cervical plating and PS-based systems, despite the recent increase in the number of spine fusion surgeries performed. 27,28e present study found comparable predictive performance between ensemble models and experienced orthopedic surgeons, which is in agreement with the result of a meta-analysis. 29The potential of DL models as supplementary diagnostic tools to improve the diagnostic accuracy of clinicians has been demonstrated. 6,30,31With the assistance of a DL model, the incidence of misinterpretation of radiologic images reduces by 47.0%. 6DL models not only help to improve diagnostic accuracy but also speed up diagnosis, which is extremely important for emergency medicine clinicians. 6,31Moreover, DL models as supplementary diagnostic tools may help clinicians with limited training in musculoskeletal imaging to enhance fracture detection accuracy. 30The current findings also suggest that the ensemble models may help inexperienced orthopedic surgeons to identify the brands of the existing implants.
While the performance of our models in identifying PS-based instrumentations is encouraging, these results are limited to the identification of only seven implant types.Brand or manufacturer preferences vary in different countries and hospitals.An expansion of the models to identify other brands of PS-based instrumentation is required to make them clinically useful. 28Accordingly, we expect to collect and externally validate data from a multi-center study that expands the number of samples for each implant design analyzed to reach peak generalizability of the ground truth. 32Using MAIA software for model training and testing allows us to efficiently include new datasets and re-train the models in an automated fashion.
Moreover, we plan to make the models available on the smartphone, the method commonly used clinically to communicate medical images. 33Of the different methods used to identify the instrumentation brand preoperatively, the most reliable and efficient is to require preoperative registration via government or insurance policy.
This study protocol was approved by the Institutional Review Board of our institution (2022-05-007AC).The medical records of patients receiving PS-based instrumentation surgery from January 1, 2018, to June 30, 2020, at our institution were retrospectively reviewed.The exclusion criteria included mismatched brands between instrumentation and crosslinks (n = 25) and the presence of two brands of instrumentation in one PRS (n = 13).The corresponding postoperative PRS on AP and lateral images and the different brands of inserted implants were gathered as our ground truths.A total of 529 patients were included for the development of our DL models.Seven types of PS-based instrumentation commonly used in our clinical institution were considered as ground truths, including (1) Aspine SmartLoc Evolution (EVO) (Smartlock Omega; A-Spine Inc., New Taipei City, Taiwan), (2) Armstrong (Paonan Biotech [BIOMECH], sing unit was NVIDIA GeForce RTX 2070.Image file formats in Digital Imaging and Communications in Medicine (DICOM) were imported into MAIA, which automatically adjusted the model structure to adapt to F I G U R E 1 Illustration of the seven enrolled pedicle screwbased instrumentations on plain radiographs of the spine in anteroposterior (AP) (left) and lateral images (middle), and the whole construct of the screw with head, neck, and body (right).(A) A-spine (EVO), (B) Armstrong, (C) CDH, (D) Expedium, (E) Gezen, (F) NOVA, (G) Xia 3. F I G U R E 2 Framework of Medical Artificial Intelligence Aggregator (MAIA) software and the structure of the built-in EfficientNet-b0 model.(A) Framework for MAIA software.(B) Structure of the built-in EfficientNet-b0 model.
The AP, Lat, and Concat models each provided only one prediction per patient.To evaluate the performance of the Merge model, we calculated the performance metrics based on three image datasets of PRS: The Merge model trained on AP images, Merge model trained on lateral images, and Merge model trained on dual images.Accuracy, precision, sensitivity, F1-score, interobserver reliability (kappa value), and area under the receiver operating characteristic curve (AUC) were calculated to evaluate the performance of the single and ensemble models.These metrics were calculated as either brand-based or overall evaluation.
The macro-average was computed by averaging the values of the brand-based evaluation.The micro-average was computed by T A B L E 1 Number of enrolled patients in training, validation, and test sets according to brand and the presence of cross-links.

Figures
FiguresS1 and S2.Results of the analysis of the confusion matrix and ROC curve in the ensemble models, regardless of crosslink, are shown in Figure4and FiguresS3 and S4.To confirm the ability of the models to identify the features of the screws, we manually reviewed the Grad-CAMs as evaluated by all models and reported by MAIA.

F I G U R E 3
Confusion matrices (left) and receiver operating characteristic (ROC) curves (right) of MAIA models regardless of the presence of crosslinks.Range of the area under the ROC curve (AUC): (A) AP model, 0.93-1; (B) Lat model, 0.95-1; (C) Concat model, 0.92-1; (D) Merge model trained on dual images, 0.89-1; (E) Merge model trained on AP images, 0.83-1; and (F) Merge model trained on lateral images, 0.93-1.The x-and y-axis in the confusion matrices represent the true labels and the predicted labels, respectively.Darker blue in the confusion matrices represents higher values.Lines are colored to indicate the following: blue, ROC curve of A-Spine; red, ROC curve of Armstrong; green, ROC curve of CDH; light blue, ROC curve of Expedium; lavender, ROC curve of Gezen; yellow-green, ROC curve of NOVA; dark blue, ROC curve of Xia 3; shocking pink dotted line, micro-average ROC curve; oriental blue dotted line, macro-average ROC curve.
ized from the head to the distal tip on a lateral image.The crosslink could be clearly visualized on an AP image, as evidenced by the Grad-CAMs heatmaps.The use of crosslinks also improved their performance.Several factors may be responsible for this observation.First, the crosslink was still partly visible on the lateral image because of the non-parallel relationship between the beam of the X-ray projector and the PS.Second, a crosslink is used to connect both sides of the PSbased constructs, especially in two-level and multi-level fixations, in order to increase pullout strength.Thus, the performance of the lateral image-based DL model increases with the number of screws that can be seen on the lateral image of the PRS.F I G U R E 4 Confusion matrixes (left) and receiver operating characteristic (ROC) curves (right) of ensemble models regardless of the presence of crosslinks.(A) T Range of the area under the ROC curve (AUC): (A) All ensemble models, 0.89-1; (B) AP + Lat ensemble model, 0.9-1; (C) AP + Lat + Concat ensemble model, 0.9-1; (D) AP + Lat + Merge ensemble model, 0.97-1; (E) Lat + Concat ensemble model, 0.92-1.The x-and yaxis in the confusion matrixes represent the true labels and the predicted labels, respectively.Darker blue in the confusion matrixes represents higher values.Lines are colored to indicate the following: blue, ROC curve of A-Spine; red, ROC curve of Armstrong; green, ROC curve of CDH; light blue, ROC curve of Expedium; lavender, ROC curve of Gezen; yellow green, ROC curve of NOVA; dark blue, ROC curve of Xia 3; shocking pink dotted line, micro-average ROC curve; oriental blue dotted line, macro-average ROC curve.In the present study, the screw body was red (very important) in all A-Spine and Expedium heatmaps, while the screw head was red (very important) in all CDH and NOVA heatmaps.Pedicle screws from different manufacturers have their own characteristics and unique constructs, which may help to partially explain why the DL models judged different locations of PS from different manufacturers based on our ground truths.However, the mechanisms underlying the above phenomenon remain to be investigated.The pronounced red intensity in Expedium's heatmap may be in part because PS manufactured by Armstrong, CDH, and Gezen were frequently misjudged as Expedium by MAIA models and ensemble models due to similar constructs.

F I G U R E 5
Illustration of gradient-weighted class activation mapping (Grad-CAM) on plain radiographs of spines of anteroposterior or lateral images to identify brands.(A) Heatmap on plain radiographs of spines on the anteroposterior image for seven brands of screws.No crosslink was used in the NOVA group because the screw was designed for the minimally invasive approach.(B) Heatmap on plain radiographs of spines in the lateral image for seven brands of screws.